Social Media Text Processing and Semantic Analysis for Smart Cities

نویسنده

João Filipe Figueiredo Pereira

چکیده

With the rise of Social Media, people obtain and share information almost instantly on a 24/7 basis. Many research areas have tried to gain valuable insights from these large volumes of freely available user generated content. The research areas of intelligent transportation systems and smart cities are no exception. However, extracting meaningful and actionable knowledge from user generated content is a complex endeavor. First, each social media service has its own data collection specificities and constraints, second the volume of messages/posts produced can be overwhelming for automatic processing and mining, and last but not the least, social media texts are usually short, informal, with a lot of abbreviations, jargon, slang and idioms. In this thesis, we try to tackle some of the aforementioned challenges with the goal of extracting knowledge from social media streams that might be useful in the context of intelligent transportation systems and smart cities. We designed and developed a framework for collection, processing and mining of geo-located Tweets. More specifically, it provides functionalities for parallel collection of geo-located tweets from multiple pre-defined bounding boxes (cities or regions), including filtering of non complying tweets, text pre-processing for Portuguese and English language, topic modeling, and transportation-specific text classifiers, as well as, aggregation and data visualization. We performed an extensive exploratory data analysis of geo-located tweets in 5 different cities: Rio de Janeiro, São Paulo, New York City, London and Melbourne, comprising a total of more than 43 millions tweets in a period of 3 months. Furthermore, we performed a large scale topic modelling comparison between Rio de Janeiro and São Paulo. As far as we know this is the largest scale content analysis of geo-located tweets from Brazil. Interestingly, most of the topics are shared between both cities which despite being in the same country are considered very different regarding population, economy and lifestyle. We take advantage of recent developments in word embeddings and train such representations from the collections of geo-located tweets. We then use a combination of bag-of-embeddings and traditional bag-of-words to train travel-related classifiers in both Portuguese and English to filter travel-related content from non-related. We created specific gold-standard data to perform empirical evaluation of the resulting classifiers. Results are in line with research work in other application areas by showing the robustness of using word embeddings to learn word similarities that bag-of-words is not able to capture. The source code and resources developed in this dissertation will be publicly available to foster further developments by the research community in smart cities and intelligent transportation systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing Smart Cities Services through Semantic Analysis of Social Streams

This paper presents a domain-agnostic framework for intelligent processing of textual streams coming from social networks. The framework implements a pipeline of techniques for semantic representation, sentiment analysis, automatic content classification, and provides an analytics console to get some findings from the extracted data. The effectiveness of the platform has already been proved by ...

متن کامل

Investigating the mechanism of the void's physical-semantic effect on social interactions

The depth of the void concept has extended the range of its effects from philosophy to various sciences and even types of art. In architecture, due to the importance of the spacing effect and architectural components on behavior, void finds a different role that seems to be less addressed in contemporary architecture. If void, regardless of its hidden meaning, is referred to as "empty space," a...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Twitris: A System for Collective Social Intelligence

Social Media Analytics The practice of gathering data from social media websites and analyzing that data to gain new insights, facilitate informed decisions and actions Semantic Web Semantic Web is a group of methods and technologies to help machines and humans understand the meaning — or " semantics " — of data on the World Wide Web Spatio-Temporal-Thematic (STT) Analysis Social media analytic...

متن کامل